Introduction

Row

Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

  • DEFINE the Problem
  • COLLECT the Data from Appropriate Sources
  • ORGANIZE the Data Collected
  • VISUALIZE the Data by Developing Charts
  • ANALYZE the data with Appropriate Statistical Methods
  • COMMUNICATE your Results

Row

The Problem & Data Collection

The Problem

The data that will be used in our analysis is from a series of over 6,000 Airbnb listings in the Seattle metro area. We will examine the variables in the dataset to determine what helps to predict the price listed for one night of stay.

The Data

This dataset has 6021 rows and 10 variables. For this analysis, we will ignore the room_id variable, which is simply an identifier that was primarily kept as a leftover from the INFO 3100 version of this dataset.

Data Sources

Levine, S. (2018, December 19). Airbnb Listing Data for Seattle and surrounding regions. Retrieved May 20, 2023. https://www.kaggle.com/datasets/shanelev/seattle-airbnb-listings

The Data

VARIABLES TO PREDICT WITH

  • room_id: unique room identifier (left from INFO 3100, will not be used)
  • room_type: what type of Airbnb listing is the room? (entire home/apt, private room, shared room)
  • address: the city within the Seattle metro area that the listing is located within (Seattle, Bellevue, Kirkland, Redmond, Mercer Island)
  • reviews: the number of reviews that have been made on the listing
  • overall_satisfaction: the average satisfaction rating across all reviews on the listing (presumably rounded to the nearest 0.5 star in the dataset)
  • accommodates: the number of people that can stay in the listing
  • bedrooms: the number of bedrooms provided in the listing
  • bathrooms: the number of bathrooms provided in the listing

VARIABLES WE WANT TO PREDICT

  • price: the current price per night given on the listing in question
  • price_categorical: price > 100 coded as 1, cheaper coded as 0

Data

Column

Organize the Data

Organizing data can also include summarizing data values in simple one-way and two-way tables.

    room_id          room_type           address             reviews      
 Min.   :    2318   Length:6021        Length:6021        Min.   :  3.00  
 1st Qu.: 8815638   Class :character   Class :character   1st Qu.: 12.00  
 Median :17556476   Mode  :character   Mode  :character   Median : 34.00  
 Mean   :16096928                                         Mean   : 59.39  
 3rd Qu.:22983501                                         3rd Qu.: 80.00  
 Max.   :30444838                                         Max.   :687.00  
 overall_satisfaction  accommodates       bedrooms       bathrooms    
 Min.   :2.500        Min.   : 1.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:4.500        1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :5.000        Median : 3.000   Median :1.000   Median :1.000  
 Mean   :4.841        Mean   : 3.661   Mean   :1.357   Mean   :1.279  
 3rd Qu.:5.000        3rd Qu.: 4.000   3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :5.000        Max.   :28.000   Max.   :8.000   Max.   :8.000  
     price        price_categorical
 Min.   :  16.0   Min.   :0.0000   
 1st Qu.:  63.0   1st Qu.:0.0000   
 Median :  85.0   Median :0.0000   
 Mean   : 107.5   Mean   :0.3445   
 3rd Qu.: 125.0   3rd Qu.:1.0000   
 Max.   :1650.0   Max.   :1.0000   

From this data we can see that our variables have a variety of different values based on their types, as well as room type and address which are both text-based categorical variables and don’t have summary statistics to show. The room ID variable is a clear outlier because of the ID status of it, so from here on out it will be removed from the dataframe.

Column

Transform Variables

In this data, price_categorical is a categorical variable that is 1 if the price variable is above 100 and 0 if not. The below code transforms that variable into a factor for future usage and breaks down the distribution.

price_categorical

Data Visualization #1

Column

Response Variables

CAT.MEDV High(1)/Low(0)

Through this graph, we can deduce that roughly 1/3 of all listings are above 100 dollars per night, which is a pretty good split for later regression analysis.

Column

Transform Variables

Data Visualization #2

Column

Response Variables

price

We see the largest concentration of listings around $50 to $100 per night, although almost equal is $100 to $150. Looking at the potential predictors related to price out of continuous variables, the strongest relationships are with the accommodates and bedrooms variables. The data is also skewed to the right as a result of extremely high prices.

Column

Transform Variables

price Analysis

Row

Predict price per night

For this analysis we will use a Linear Regression Model.

Adjusted R-Squared

45 %

RMSE

61.4

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
bedrooms 31.500 1.475 21.357 0.000
room_typePrivate room -39.776 2.107 -18.879 0.000
bathrooms 29.890 1.751 17.071 0.000
room_typeShared room -85.263 7.037 -12.116 0.000
reviews -0.111 0.012 -9.401 0.000
overall_satisfaction 16.519 2.849 5.799 0.000
(Intercept) -46.316 14.281 -3.243 0.001
accommodates 1.563 0.652 2.398 0.017
addressSeattle 5.227 4.123 1.268 0.205
addressMercer Island 12.915 10.737 1.203 0.229
addressRedmond -6.891 8.005 -0.861 0.389
addressKirkland 2.558 6.320 0.405 0.686

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, we can determine that there is one variable that is not important for determining price, that being the categorical variable of address for the city within the metro area that the listing is in. From this, we can prune out the variable for a better version of the linear regression.

Row

Predict price per night, final version

For this analysis we will use a pruned Linear Regression Model. We removed the address (city of listing) variable from this model.

Adjusted R-Squared

45 %

RMSE

61.41

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
bedrooms 31.335 1.470 21.319 0.000
room_typePrivate room -40.405 2.080 -19.424 0.000
bathrooms 29.952 1.750 17.120 0.000
room_typeShared room -85.170 7.036 -12.104 0.000
reviews -0.109 0.012 -9.275 0.000
overall_satisfaction 16.660 2.846 5.854 0.000
(Intercept) -42.184 13.856 -3.045 0.002
accommodates 1.615 0.651 2.481 0.013

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, looking at the residual plots we can see some interesting aspects with our price data. The high values at the right of the Q-Q plot are most likely due to the outliers in the price data and the resulting large skew with a tail to the right side of the data. With the Residuals vs Fitted graph, though, the overall pattern of the data shows that the model is pretty accurate compared to the actual data with only the high Q-Q plot values notably outlying. More models of different types would likely be needed to further validate the level of accuracy with this dataset.

Reducing the predictor that did not help with prediction of the price had little to no impact on our fit statistics (R-square and RMSE (root mean squared error)).

From the following table, we can see the effect on the price by the predictor variables.

Variable Direction
room_type (both compared options) Decrease
reviews Decrease
overall_satisfaction Increase
accommodates Increase
bedrooms Increase
bathrooms Increase

price_categorical Analysis

Row

Predict price per night

Conclusion

Summary

In conclusion, the predicting variables only do decently at predicting the price per night for an Airbnb listing, either the through the higher or lower prices (high being above $100 and below being at or lower than $100) or the actual prices. It’s likely that either the sample size was too small and the data never developed any clear patterns or that making a prediction of this type is not exactly reliable without overfitting a large amount of additional variables for specific situations.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
Decrease_Price Increase_Price
Number of reviews left on the listing Satisfaction rating of listing
Number of occupants accommodated
Number of bedrooms available
Number of bathrooms available
---
title: "Airbnb Listings in Seattle Analysis"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```

```{r load_data}
df <- read_csv("Dettmar_INFO3200ProjectData.csv")
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=250}
-----------------------------------------------------------------------

### Overview 

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
The data that will be used in our analysis is from a series of over 6,000 Airbnb listings in the Seattle metro area. We will examine the variables in the dataset to determine what helps to predict the price listed for one night of stay.


#### The Data
This dataset has 6021 rows and 10 variables. For this analysis, we will ignore the `room_id` variable, which is simply an identifier that was primarily kept as a leftover from the INFO 3100 version of this dataset.

#### Data Sources
Levine, S. (2018, December 19). Airbnb Listing Data for Seattle and surrounding regions. Retrieved May 20, 2023.
https://www.kaggle.com/datasets/shanelev/seattle-airbnb-listings


### The Data
VARIABLES TO PREDICT WITH

* *room_id*: unique room identifier (left from INFO 3100, will not be used) 
* *room_type*: what type of Airbnb listing is the room? (entire home/apt, private room, shared room)
* *address*: the city within the Seattle metro area that the listing is located within (Seattle, Bellevue, Kirkland, Redmond, Mercer Island) 
* *reviews*: the number of reviews that have been made on the listing
* *overall_satisfaction*: the average satisfaction rating across all reviews on the listing (presumably rounded to the nearest 0.5 star in the dataset)
* *accommodates*: the number of people that can stay in the listing
* *bedrooms*: the number of bedrooms provided in the listing
* *bathrooms*: the number of bathrooms provided in the listing

VARIABLES WE WANT TO PREDICT

* *price*: the current price per night given on the listing in question
* *price_categorical*: price > 100 coded as 1, cheaper coded as 0

Data
=======================================================================


Column {data-width=650}
-----------------------------------------------------------------------
### Organize the Data
Organizing data can also include summarizing data values in simple one-way and two-way tables.

```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.

#Clean data by replacing spaces with decimals
colnames(df) <- make.names(colnames(df))
#View data
summary(df,-room_type -address)
#remove RAD due to it being an index so not a real continuous number
df <- select(df,-room_id)
```
From this data we can see that our variables have a variety of different values based on their types, as well as room type and address which are both text-based categorical variables and don't have summary statistics to show. The room ID variable is a clear outlier because of the ID status of it, so from here on out it will be removed from the dataframe.

Column {data-width=350}
-----------------------------------------------------------------------
### Transform Variables
In this data, price_categorical is a categorical variable that is 1 if the price variable is above 100 and 0 if not. The below code transforms that variable into a factor for future usage and breaks down the distribution.
```{r, cache=TRUE}
df <- mutate(df, price_categorical = as.factor(price_categorical))
```
#### price_categorical

<!--Instructions to import .jpg or .png images
use getwd() to see current path structure 
copy file into same place as .Rmd file
put the path to this file in the link
format: ![Alt text](book.jpg) -->

![](PriceCatDist.png)


Data Visualization #1
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### CAT.MEDV High(1)/Low(0)
```{r, cache=TRUE}
as_tibble(select(df, price_categorical) %>% table()) %>% 
  ggplot(aes(y = n, x = price_categorical)) + geom_bar(stat="identity")
```

Through this graph, we can deduce that roughly 1/3 of all listings are above 100 dollars per night, which is a pretty good split for later regression analysis.


Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables

```{r, cache=TRUE}
ggpairs(select(df, price_categorical, reviews, overall_satisfaction, accommodates, bedrooms, bathrooms))
```


Data Visualization #2
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables

#### price
```{r, cache=TRUE}
ggplot(df, aes(price)) + geom_histogram(bins=33)
```

We see the largest concentration of listings around $50 to $100 per night, although almost equal is $100 to $150. Looking at the potential predictors related to price out of continuous variables, the strongest relationships are with the accommodates and bedrooms variables. The data is also skewed to the right as a result of extremely high prices.


Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables

```{r, cache=TRUE}
ggpairs(select(df, price, reviews, overall_satisfaction, accommodates, bedrooms, bathrooms))
```


price Analysis {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict price per night
For this analysis we will use a Linear Regression Model.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
price_lm <- lm(price ~ . -price_categorical, data = df)
summary(price_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(price_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(price_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(price_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

Row
-----------------------------------------------------------------------

### Regression Output

```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(price_lm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(price_lm))[,4])  
out <- coef(summary(price_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(price_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, we can determine that there is one variable that is not important for determining price, that being the categorical variable of address for the city within the metro area that the listing is in. From this, we can prune out the variable for a better version of the linear regression.

Row
-----------------------------------------------------------------------

### Predict price per night, final version
For this analysis we will use a pruned Linear Regression Model. We removed the address (city of listing) variable from this model.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
price_lm <- lm(price ~ . -price_categorical -address, data = df)
summary(price_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(price_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(price_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(price_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

Row
-----------------------------------------------------------------------

### Regression Output

```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(price_lm)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(price_lm))[,4])  
out <- coef(summary(price_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(price_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```


Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, looking at the residual plots we can see some interesting aspects with our price data. The high values at the right of the Q-Q plot are most likely due to the outliers in the price data and the resulting large skew with a tail to the right side of the data. With the Residuals vs Fitted graph, though, the overall pattern of the data shows that the model is pretty accurate compared to the actual data with only the high Q-Q plot values notably outlying. More models of different types would likely be needed to further validate the level of accuracy with this dataset.

Reducing the predictor that did not help with prediction of the price had little to no impact on our fit statistics (R-square and RMSE (root mean squared error)).

From the following table, we can see the effect on the price by the predictor variables.

```{r, cache=TRUE}
#create table summary of predictor changes
predchang = tibble(
  Variable = c('room_type (both compared options)', 'reviews', 'overall_satisfaction','accommodates','bedrooms','bathrooms'),
  Direction = c('Decrease', 'Decrease', 'Increase', 'Increase','Increase','Increase')
)
knitr::kable(predchang) #pretty table output

```




price_categorical Analysis {data-orientation=rows}
=======================================================================

Row {data-height=900}
-----------------------------------------------------------------------

### Predict price per night
![](PriceCatAnalysis.png)


Conclusion
=======================================================================
### Summary

In conclusion, the predicting variables only do decently at predicting the price per night for an Airbnb listing, either the through the higher or lower prices (high being above $100 and below being at or lower than $100) or the actual prices. It's likely that either the sample size was too small and the data never developed any clear patterns or that making a prediction of this type is not exactly reliable without overfitting a large amount of additional variables for specific situations.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predchangfnl = tibble(Decrease_Price = 
                            c("Number of reviews left on the listing",
                              "",
                              "",
                              ""),
                    Increase_Price = c("Satisfaction rating of listing",
                                       "Number of occupants accommodated",
                                       "Number of bedrooms available",
                                       "Number of bathrooms available"))  
knitr::kable(predchangfnl) #pretty table output
```